Covid-19 Trends on the Global Level
Exploring worldwide Covid Data!
This is an R Markdown blog template. This document will be knit to HTML to produce a webpage that will be hosted publicly via GitHub.
Website publication work flow
You can include text, code, and output as usual.
Remember to take full advantage of Markdown and follow our Style
Guide.
Examples and additional guidance are provided below.
Take note of the the default code chunk options in the
setup code chunk. For example, unlike the rest of the Rmd
files we worked in this semester, the default code chunk option is
echo = FALSE, so you will need to set
echo = TRUE for any code chunks you would like to display
in the blog. You should be thoughtful and intentional about the code you
choose to display.
Tabitha - Covid Publications Trends and Analysis
During the pandemic, scientists around the globe were putting all of their resources into better understanding Covid-19 and finding a vaccine. This led to an unprecedented level of scientific advancement and research, which is reflected in the scientific publications released to document these discoveries. For this analysis, I used data from Dimensions, the world’s largest linked research database. The dataset I used encompasses all publications, datasets, and clinical trials from Dimensions pertaining to Covid-19 from March 2020 to September 2021, and can be accessed here. With such a comprehensive data source, I hoped to explore trends and similarities in Covid-related publications. Specifically, I wanted to analyze which countries contributed the most to Covid research efforts based on their publication amounts, and what the most common topics of interest were across all the Covid publications based on their titles.
Publications by Country
In order to examine publication prevalence by country, I used the country of research origin variable from the dataset to determine what country to assign each publication title to. Many of these publications are collaborations and had multiple authors, some from the same country and some from different countries. In those cases, I double-counted a specific publication title for multiple countries, but I did not double-count it for the same country. I figured that for the purpose of this analysis I wasn’t interested in how many authors from each country were published, but rather how many publications were at least partially published with data from that country; therefore I didn’t want the data skewed by say a publication with 5 authors all from the US counting as 5 different publications.
Although everyone was working around the clock to find a vaccine, there were some countries that rose above with their amount of contributions. As you can see, in both 2020 and 2021 the United States was the top contributor of Covid related publications. The US had more than double the number of publications than the second and third most published countries, the United Kingdom and China. Additionally, although there is some movement between the two years, the same 10 countries had the most publications throughout the height of the pandemic.
Most Common Publication Topics
Although the dataset I used on publications was pretty comprehensive, it didn’t include the actual text of the papers. Since many of these papers also weren’t open access, in order to analyze common topics I had to focus on the paper titles. While you would assume that the title of a paper accurately reflects the main topics of the writing, that is not always the case, so this topical analysis is far from perfect in its scope. Also, I removed the typical stop words for this analysis (from the stop_words dataset), but due to the medical and global context, there were some additional stop words I manually removed. These included stop words in other languages such as “de” and “la”, and words redundant to this analysis such as “Covid-19” and “2020”.
As you can see, many of the most common topics (based on the publication title) aren’t directly about a vaccine. Many focus on Covid patients and their health, the severity of infection, or the impact of care received. Also, there is a common theme of learning and relation in these studies, which emphasizes how much of the research effort was exploratory due to how little we knew about the disease and its treatment. It is also important to note that all of these publications weren’t clinical trials or even strictly medical; rather, they include any and all published papers (on Dimensions) that related to Covid in some way. So not all of these are strictly focused on the patient health aspect (although clearly many of them were), and it seems quite a few are general reviews of the existing literature suggesting possible future areas of study.
Although these visualizations are valuable tools in understanding the publications released on Covid during the height of the pandemic, there are several limitations to the conclusions we can draw from this data due to the manipulations necessary to perform these analyses. For example, all of the publications counted in both of these figures had to have titles written using the Latin alphabet, which means that English-speaking (or publishing) countries would be much more favored in the analysis. Also, some publications were excluded from these analyses that didn’t have titles or countries assigned to them in the dataset, which could have been a flaw in the way the dataset was collected which may negatively affect certain countries.
Lynca
Emma - Covid Recovery Times
For this PUG project we chose to continue to work with COVID data on
a global scale as opposed to just statewide data in Massachusetts. My
goal for this project was to discover what qualities and conditions of
countries were similar between those who took longer to “recover” from
COVID in 2021. To do this I worked with k-means clustering and a handful
of attributes in the OWID COVID dataset. I wanted specifically to use
attributes that did not technically “have anything to do with” COVID or
a pandemic to see how the country at its pre-pandemic level reacted. I
looked for poverty as well as income data in other sources, but it was
either incomplete or not in the years that I wanted so I instead stayed
within the OWID dataset. I narrowed down my variables to population
density, GDP per capita, the cardiovascular death rate, and diabetes
prevalence for each country as they were both the most complete datasets
and msot interesting to me. My next big step was to determine my
daystil variable, which would tell me how long it took, in
number of days, for countries to recover. I did some exploring on the
OWID website with their distribution displays and ended up deciding on
finding the number of days countries took, including all of 2020 but
only evaluating in 2021, to get to one death per million. I did not
evaluate all of 2020 for a few reasons: Things started out bad and got
worse, and I didn’t want that “started out” portion to be chosen for
when countries were at one death per million; The summer allowed for
much more outside time and less school with lowered rates so I wanted to
evaluate recovery from a winter where people would have been inside;
Countries that were affected later needed to be taken into
consideration. So, I evaluated only 2021 and forward for at what point
countries hit the 1 death per million mark.
I used the elbow plot to pick the number of clusters and ended up with three, in Figure 1. Four also would have been fine but I didn’t think that adding the complexity of four was going to be adding much to anything I was doing, so I just went ahead with three.
Figure 1
In terms of my grouping, I ended up with three clusters that were certainly different in terms of GDP per capita, cardiovascular death rate, and population density, but I had pretty close numbers for days until recovery with 312, 316, and 359 days in each group (Figure 2). I also noticed that Russia was a huge outlier in terms of days until recovery, so I tried clustering without Russia in the picture. I did the elbow plot again for this too and decided on three clusters as well. The withins are slightly lower for one group in the data without Russia than the data with Russia, but the change in data seemed to make more of a difference in terms of the visualization than in terms of the actual statistical results so I kept Russia in the data (Figure 3). I thought the display was most interesting and easy to see in terms of different clusters when plotting GDP per capita and days until recovery.
Figure 2
| Cluster | Cluster Size | Days until Recovery | GDP per Capita | Population Density | Cardiovascular Death Rate | Diabetes Prevalence | Withins |
|---|---|---|---|---|---|---|---|
| 1 | 9 | 312.1111 | 76769.75 | 1010.9996 | 154.3424 | 10.222222 | 2898206069 |
| 2 | 126 | 316.4524 | 7742.67 | 143.2296 | 303.1972 | 7.896111 | 3995840866 |
| 3 | 49 | 359.6122 | 34404.29 | 220.6610 | 187.6149 | 8.176327 | 3997169441 |
Figure 3
Finally, I removed all countries that “recovered” on day one of 2021, or in 297 days. There were many countries that were already under one death per million, and I wondered if their prevalence was messing with my data. I hesitate to make this my entire project, because there are certainly countries that did this and managed to stay in control all on their own but I think a large number of countries have this statistic because the reporting was more difficult to get a hold of. Results were very similar, in that I didn’t get huge differences between the number of days it took for countries to recover per group and the trends with GDP, population density, and my other variables were consistent.
For my one last sort of bonus step, I fit a multiple linear model to try to predict days until recovery, picking variables from the same set of variables I used for clustering and ending up just using GDP per capita and population density (Figure 4). It’s not as nuanced as clustering is, but it was interesting to compare some more “basic” observations about days until recovery in countries – higher GDP per capita and lower population density contribute to higher numbers of days until recovery – with what my clusters seemed to communicate. Looking at the clusters (3 clusters, including Russia), I found that the highest GDP per capita and highest population density as well as lowest GDP per capita and lowest population density was grouped with the lower days until recovery, whereas the middle ground in those two respects had a higher days until recovery. Diabetes prevalence and cardiovascular death rate did not seem to follow any specific pettern of higher or lower numbers grouping with days until recovery (Figure 2).
Figure 4
##
## Call:
## lm(formula = daystil ~ gdp_per_capita + population_density, data = covidCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.76 -26.58 -21.54 -5.80 446.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.177e+02 6.353e+00 50.003 < 2e-16 ***
## gdp_per_capita 7.124e-04 2.500e-04 2.849 0.00489 **
## population_density -1.416e-02 7.676e-03 -1.845 0.06667 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.57 on 181 degrees of freedom
## Multiple R-squared: 0.04864, Adjusted R-squared: 0.03813
## F-statistic: 4.627 on 2 and 181 DF, p-value: 0.01097
I had high hopes for what my data could and would show, but I think there are some serious limitations to this project. My biggest challenge was figuring out how I was going to make my days until recovery variable. The number that I picked, 1 death per million, seems like a good metric to me but I think a better number would have been one that took the population density of the country into account. Even though this number is a per capita number, which takes into account the actual size of the population, I wonder if population density might guide me to finding different and more specific metrics. The 2020/2021 choice for when to start counting days was also a little shaky but I think the reasoning behind that is much more solid than the 1 death per million. Another thing I wish I could have implemented better is the poverty statistics I was looking to be able to cover for all countries (Instead of just SIX when I narrowed it down :( ) and something with healthcare. I was also unable to find that over the countries I wanted to predict, but I think it would have been really interesting to class levels of healthcare coverage and then use that to group countries with days until recovery.
Ultimately I think this is a really interesting concept, and if I had better and more complete data on a wider range of country attributes like housing type, rural vs. urban populations, healthcare coverage, etc. I would definitely have a wider range of things to choose from and probably more interesting results. Were that to exist, this data could help the WHO advise countries on what to expect from a pandemic based on their current situation and how possibly to put themselves in a better position to resist one.